Analysis of top 1000 HiveMC Bedwars players

Content

  • Introduction: HiveMC and Bedwars
  • Data description and objectives
  • Data acquisiton, manipulation and validation
  • Data analysis and visualization
  • Conclusion

1. Introduction: HiveMC and Bedwars

The HiveMC is an official Minecraft Java Edition Server. Minecraft is a sandbox video game developed by Mojang Studios. The game was created by Markus "Notch" Persson in the Java programming language and released as a public alpha for personal computers in 2009 before officially releasing in November 2011, with Jens Bergensten taking over development. Currently, Minecraft Java Edition surpasses 30 million lifetime sales.

HiveMC or The Hive is a Minigames server owned by Hive Games Limited. It was first registered in 2018. Focuses on providing players with different fun games such as SkyWars, Bedwars, Hide and Seek and DeathRun. Over 12 million unique minecraft accounts visited the server at least once.

Bedwars is strategy and player vs. player based minigame where you must protect your bed whilst trying to eliminate your opponents on islands in the sky. You can continue to respawn while your bed is safe. If your bed is destroyed, you can no longer respawn and you are eliminated once you die. Before being first released on multiplayer servers it was a custom game map, developed at 2012, to play with a company of friends. At the time was called as "Rush" and was developed by the man named Xisuma. Became very popular once was released on GommeHD, German minigames server, and name was changed to "Bedwars".

As of 2020, the Bedwars is one of the most popular gamemodes among Minecraft minigames. Average unique player count per day for this gamemode is over 10000 players. I also play bedwars and by this date I'm on 492 place among all 3.6 millions The Hive Bedwars players.

This Project will analyze best 1000 Bedwars players of The Hive.

2. Data description and objectives

The data consits of the players statistics - namely, the indicators of player's progress in the game. In this analysis we are interested in comparing those statistics among different players. Besides that, further in the project we will make comparison between countries from which players originated. We took a sample of top 1000 players because those players spent a lot of time on getting there. This also helps to avoid duplicate considering in the analysis many accounts of the same player, which obviously, will get the statistics too biased.

Our analysis will be based on data of 01.10.2020 obtained from HiveMC API(https://api.hivemc.com/). We will aditionally scrap the information about the country of the player. Below is the main attributes of data that will be obtained, scraped and used for our analysis:

  • Place - rank of the player based on game points
  • Points - number of game points* player got
  • Games played - number of games that a player participated in
  • Victories - number of wins player has
  • Kills - total amount of other player kills subject got
  • Deaths - total number of times player died in game
  • Beds - number of beds player by himself destroyed
  • Team eliminated - number of other teams eliminated
  • Winstreak - number of current wins in a row
  • Country - origin country of the player

For this project, data analysis and visualization contains 5 parts:

  1. Analyze kill/death ratio (KDR), win/loss ratio (WLR) of top 1000 players
  2. Analyze relationship between variables and place of the player in the data set
  3. Analyze the winstreak, relationship between winstreak and KDR, WLR
  4. Propose new ranking model based on average points per game
  5. Analyze geographical tendencies among players' origin countries

*- Points can be obtained by killing players(5 points), breaking beds (50-80 points), upgrading resource generators (5-15 points)

3. Data acquisiton, manipulation and validation

3.1. Data acquisiton: Use of HiveMC API and scrap from NameMC website

We are going to use Beatifulsoup package to get the data from HiveMC API and NameMC. In order to get our data we need to do 5 API requests and scrap from 1000 pages. The algorithm works as following: first we get data of first 200 players in the top from HiveAPI in form of JSON string and then transformed to object form. Using the Username data we then procceed to first 200 pages of NameMC to scrap the data about the country of the player. Once obtained all of the data will be written to CSV file for further use. Repeat the proccess for the rest 4 API requests. Webscrap took about 4 hours.

Source of webscrap: (https://ru.namemc.com/)

In [1]:
#import all the needed packages
import requests
from bs4 import BeautifulSoup
import json
import time
import csv
import numpy as np

# Import matplotlib
import matplotlib.pyplot as plt

# Import plotly
import plotly.express as px 
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Next part of the code will not run since it takes to much time and rewrites the existing file. You can run it yourself if you would like to do that. Code also included in the GitHub repo. Changing the proxy each 5 scraps from NameMC could also help to boost the process without loosing the data.

In [2]:
#The Scrap took about 4 hours due to need in delay
# with open('hive_players.csv', mode='w') as hive_players:
#     fieldnames = ['Place', 'Username', 'Points' , 'Victories', 'Games', 'Kills', 'Deaths', 'Beds', 'Teams Eliminated', 'Winstreak', 'Country']
#     player = csv.DictWriter(hive_players, fieldnames=fieldnames)
#     player.writeheader()
#     for j in range (0, 5): #j itterates through api requests
#         #we need to go through 5 api requests as max range of data that can be retrieved per 1 request is 200.
        
#         k = j*200 #this will be the start place of the player
#         m = k+200 #end point in request
#         URL = 'https://api.hivemc.com/v1/game/BED/leaderboard/' 
#         URL = URL + str(k) + '/' + str(m)

#         page = requests.get(URL)

#         soup = BeautifulSoup(page.text, 'html.parser')
#         y = json.loads(soup.prettify())
#         f = 0 
#         for i in range(0, 200):
#             #the id each api request is updated
#             if f == 5:
#                 #that's needed since there limited amount of requests at a time range is allowed on NameMC website
#                 time.sleep(60)
#                 f = 0
#             URL2 = 'https://ru.namemc.com/profile/' + y["leaderboard"][i]["username"] # 

#             page2 = requests.get(URL2)

#             sup = BeautifulSoup(page2.text, 'html.parser')
#             country = None
#             #path to needed html element 
#             a1 = sup.find_all('div', class_ ='row')
#             if len(a1)>=1 :
#             #just to verify if we on right track
#                 a2 = a1[1].find_all('div', class_='col-lg-8')
#                 if len(a2)>=1 :
#                     a3 = a2[0].find_all('div', class_='card mb-3')

#                     if len(a3)>=3 :
#                         a4 = a3[2].find_all('div', class_='card-body py-1')
#                         if len(a4)>=1 :
#                             a5 = a4[0].find_all('div', class_='row')
#                             if len(a5)>=1 :
#                                 for itter in range (0, len(a5)) : #itterating through this card to find right element
#                                     if a5[itter].find('div', class_='col-md-4').text == "Страна" : #choosing the right element
#                                         country = a5[itter].find('div', class_='col-auto').text

#             #writing data to csv file...
#             player.writerow({'Place' : j*200 + i + 1, 'Username' : y["leaderboard"][i]["username"], 'Points' : y["leaderboard"][i]["total_points"], 'Victories' : y["leaderboard"][i]["victories"], 'Games' : y["leaderboard"][i]["games_played"], 'Kills' : y["leaderboard"][i]["kills"], 'Deaths' : y["leaderboard"][i]["deaths"], 'Beds' : y["leaderboard"][i]["beds_destroyed"], 'Teams Eliminated' : y["leaderboard"][i]["teams_eliminated"], 'Winstreak' : y["leaderboard"][i]["win_streak"], 'Country' : country})
       
In [3]:
#Now in the pandas form:
import pandas as pd

table = pd.read_csv('hive_players.csv')
table[:10]
Out[3]:
Place Username Points Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
0 1 StrafeYosef 6197520 15683 22251 152661 19118 22776 22080 1 Israel
1 2 prinsese1 5859040 16208 19877 95197 35171 16855 20461 0 None
2 3 Lcea 5224840 14517 20880 127275 44404 16311 13458 4 Germany
3 4 HappyStateOfMind 4812445 13889 18162 79314 35140 6719 6660 28 Norway
4 5 Taivax 4392645 10855 11433 91210 12997 25977 23505 80 United Kingdom
5 6 Dragasdata 4286425 11087 15368 80086 28333 26353 20198 2 None
6 7 xoLarry2Pro 4285105 10884 13333 95093 23854 14255 12560 24 France
7 8 Foony 4255195 11193 11382 78566 12488 14652 14969 420 Netherlands
8 9 CRYBL0CKER 4241915 11438 14361 90960 25247 13085 13814 29 South Africa
9 10 Kylic 4231945 11016 14150 101989 36658 14707 13627 29 United Kingdom

3.2. Data manipulation: cleaning and shaping

At this step we need to reshape our data a bit and assign appropriate column names.

  • We will replace missing values by NaN
  • Change type of some columns from string to int
  • Convert some values
  • Determine how many values are missing
In [4]:
#Missed values are given by "None" so we need to replace them
table.replace("None", np.nan, inplace = True)

# some columns with numeric values should be converted to int type
table = table.astype({"Points": "int", "Victories": "int", "Games": "int", "Kills": "int", "Deaths": "int", "Beds": "int", "Teams Eliminated": "int", "Winstreak": "int"})
table.head(5)
Out[4]:
Place Username Points Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
0 1 StrafeYosef 6197520 15683 22251 152661 19118 22776 22080 1 Israel
1 2 prinsese1 5859040 16208 19877 95197 35171 16855 20461 0 NaN
2 3 Lcea 5224840 14517 20880 127275 44404 16311 13458 4 Germany
3 4 HappyStateOfMind 4812445 13889 18162 79314 35140 6719 6660 28 Norway
4 5 Taivax 4392645 10855 11433 91210 12997 25977 23505 80 United Kingdom
In [5]:
table.tail(5)
Out[5]:
Place Username Points Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
995 996 ThunderGamingX 510225 1134 2513 12852 3238 3702 3184 3 NaN
996 997 Kuadro 510135 1332 3428 10317 7530 1254 1023 0 NaN
997 998 swordfish09304 509375 1347 2727 11513 8708 1377 1151 2 NaN
998 999 LaMinecraftienne 509185 1434 2265 11413 5499 2792 1831 2 NaN
999 1000 0hSora 508290 1112 2626 12072 11361 1818 1529 0 NaN
In [6]:
#Number of points is pretty large, more than 500 thousands for each player
#We need to convert this column, lets divide by thousand and round to 1 decimal place 

points = table["Points"] #save that for later

table["Points"] = table["Points"].div(1000).round(1)
table.head(5)
Out[6]:
Place Username Points Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
0 1 StrafeYosef 6197.5 15683 22251 152661 19118 22776 22080 1 Israel
1 2 prinsese1 5859.0 16208 19877 95197 35171 16855 20461 0 NaN
2 3 Lcea 5224.8 14517 20880 127275 44404 16311 13458 4 Germany
3 4 HappyStateOfMind 4812.4 13889 18162 79314 35140 6719 6660 28 Norway
4 5 Taivax 4392.6 10855 11433 91210 12997 25977 23505 80 United Kingdom
In [7]:
#Now we need to rename Points column so that it represents correct info
table.rename(columns = {'Points' : 'Points (in thousands)'}, inplace = True)
table.head(5)
Out[7]:
Place Username Points (in thousands) Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
0 1 StrafeYosef 6197.5 15683 22251 152661 19118 22776 22080 1 Israel
1 2 prinsese1 5859.0 16208 19877 95197 35171 16855 20461 0 NaN
2 3 Lcea 5224.8 14517 20880 127275 44404 16311 13458 4 Germany
3 4 HappyStateOfMind 4812.4 13889 18162 79314 35140 6719 6660 28 Norway
4 5 Taivax 4392.6 10855 11433 91210 12997 25977 23505 80 United Kingdom
In [8]:
#Lets determine how many values are missing
table_validation = pd.DataFrame()
table_validation["Columns"] = list(table.columns)
table_validation["Count"] = list(table.count())
table_validation[:]
Out[8]:
Columns Count
0 Place 1000
1 Username 1000
2 Points (in thousands) 1000
3 Victories 1000
4 Games 1000
5 Kills 1000
6 Deaths 1000
7 Beds 1000
8 Teams Eliminated 1000
9 Winstreak 1000
10 Country 456
In [9]:
plt.figure(figsize=(18,6))

not_enough = [i if i < 1000 else 0 for i in table_validation["Count"]]
plt.bar(table_validation["Columns"], table_validation["Count"])
plt.bar(table_validation["Columns"], not_enough, color = 'gray')#color to indicate those columns where there is not enough data
plt.xticks(rotation = 90, fontsize = 12)
plt.show()

So the only missing values are the countries of some players, which is, due to some players not being registered to NameMC or not sharing their information about the country. Those entries with no data about country takes the dominant proportion. Further in geological analysis we will not use them.

4. Data Visualization

4.1 Q1: Analyze kill/death ratio (KDR), win/loss ratio (WLR) of top 1000 players

Before we begin analyzing the ratios, lets look at the plots of kills and deaths individually

In [10]:
plt.figure(figsize=(18,6))

#creating plot for kills
plt.subplot(121) 
plt.plot(table['Kills'])
plt.xlabel('Place')
plt.ylabel('Kill count')
plt.title('Figure 4.1 Number of kills of the player', fontsize = 16)

#creating plot for deaths
plt.subplot(122)
plt.plot(table['Deaths'])
plt.xlabel('Place')
plt.ylabel('Death count')
plt.title('Figure 4.2 Number of deaths of the player', fontsize = 16)
plt.show()

Now lets get a kill/death ratio (KDR) table to further analyze

In [11]:
kd_rt = pd.DataFrame() #separate dataframe to see the table
kd_rt['Place'] = table['Place'] #we use place, username of the player
kd_rt['Username'] = table['Username']
kd_rt['Kills/Deaths'] = table['Kills'].div(table['Deaths'])
kd_rt.head(10)
Out[11]:
Place Username Kills/Deaths
0 1 StrafeYosef 7.985197
1 2 prinsese1 2.706690
2 3 Lcea 2.866296
3 4 HappyStateOfMind 2.257086
4 5 Taivax 7.017773
5 6 Dragasdata 2.826598
6 7 xoLarry2Pro 3.986459
7 8 Foony 6.291320
8 9 CRYBL0CKER 3.602804
9 10 Kylic 2.782176
In [12]:
kd_rt['Kills/Deaths'].describe() #some descriptive statistics
Out[12]:
count    1000.000000
mean        2.336954
std         1.484046
min         0.349944
25%         1.374774
50%         1.942250
75%         2.849133
max        15.760882
Name: Kills/Deaths, dtype: float64
In [13]:
kd_rt.sort_values(by=['Kills/Deaths'], ascending = False) #sort to see the best and worst players
Out[13]:
Place Username Kills/Deaths
642 643 LOWIQQ 15.760882
323 324 turtlelord66 9.879791
139 140 Vadanite 9.444688
117 118 THE_MAN0012 9.433173
633 634 RoccoDev 9.306017
... ... ... ...
555 556 bloodybabies 0.471788
613 614 Jilana 0.465522
421 422 ycvn 0.463940
975 976 BbyTiger 0.418229
321 322 _XaviWxlf 0.349944

1000 rows × 3 columns

In [14]:
plt.figure(figsize=(12,6)) #creating a histogram 
plt.hist(kd_rt['Kills/Deaths'], 100, density=1, facecolor='b', alpha=0.75)
plt.xlabel('Kills/Deaths')
plt.ylabel('Probability')
plt.title('Figure 4.3 Histogram of Kills/Deaths ratio' , fontsize = 16)
plt.text(11, .45, r'$\mu=2.337,\ \sigma=1.484$', fontsize = 16)
plt.grid(True)
plt.show()
In [15]:
print('The value of skewness is: '+ str(kd_rt['Kills/Deaths'].skew().round(3))) #to calculate value of skewness
The value of skewness is: 2.368

So we can clearly see from the Figure 4.3 and the value of skewness that Kills/Deaths ratio is positively skewed. The best player for the KDR is LOWIQQ and the worst is _XaviWxlf.

Now let's do the same for the wins and losses. Wins/Losses Ratio = WLR

In [16]:
plt.figure(figsize=(18,6))

#first plot
plt.subplot(121)
plt.plot(table['Victories'], color ='g')
plt.xlabel('Place')
plt.ylabel('Win count')
plt.title('Figure 4.4 Number of wins of the player', fontsize = 16)

#second plot
losses = table['Games'].sub(table['Victories'], axis = 0)
plt.subplot(122)
plt.plot(losses, color ='g')
plt.xlabel('Place')
plt.ylabel('Loss count')
plt.title('Figure 4.5 Number of losses of the player', fontsize = 16)
plt.show()
In [17]:
wl_rt = pd.DataFrame() #separate dataframe creation
wl_rt['Place'] = table['Place']
wl_rt['Username'] = table['Username']
wl_rt['Wins/Losses'] = table['Victories'].div(losses)
wl_rt.head(10)
Out[17]:
Place Username Wins/Losses
0 1 StrafeYosef 2.387789
1 2 prinsese1 4.417552
2 3 Lcea 2.281471
3 4 HappyStateOfMind 3.250410
4 5 Taivax 18.780277
5 6 Dragasdata 2.589815
6 7 xoLarry2Pro 4.444263
7 8 Foony 59.222222
8 9 CRYBL0CKER 3.913103
9 10 Kylic 3.514997
In [18]:
wl_rt['Wins/Losses'].describe() #descriptive statistics 
Out[18]:
count    1000.000000
mean        3.138662
std         9.539045
min        -1.153784
25%         0.891796
50%         1.488284
75%         2.675207
max       183.600000
Name: Wins/Losses, dtype: float64
In [19]:
wl_rt.sort_values(by=['Wins/Losses'], ascending = False) #best and worst players...
Out[19]:
Place Username Wins/Losses
599 600 tastydish 183.600000
171 172 Hwamzx 151.925926
488 489 AzzifyBestWW 106.869565
210 211 Shaidyn 77.500000
7 8 Foony 59.222222
... ... ... ...
799 800 empix 0.222580
479 480 i1ii 0.209360
838 839 Oh_Srry 0.198855
975 976 BbyTiger 0.062513
296 297 TryMeHacker -1.153784

1000 rows × 3 columns

In [20]:
plt.figure(figsize=(12,6)) #creating a histogram
plt.hist(wl_rt['Wins/Losses'], 100, density=1, facecolor='g', alpha=0.75)
plt.xlabel('Wins/Losses')
plt.ylabel('Probability')
plt.title('Figure 4.6 Histogram of Wins/Losses ratio' , fontsize = 16)
plt.text(125, .25, r'$\mu=3.139,\ \sigma=9.539$', fontsize = 16)
plt.grid(True)
plt.show()
In [21]:
print('The value of skewness is: '+ str(wl_rt['Wins/Losses'].skew().round(3)))
The value of skewness is: 13.008

It's clear that the win/loss ratio has a positive skewness. We also can see something irregular, the player 'TryMeHacker' has a negative WLR which should be impossible. That's due to the in game bug (which is already fixed) that the player used to get impossible amount of wins. We should exclude this player from further analysis.

The biggest WLR has the player named tastydish. BbyTiger comes with the lowest WLR.

In [22]:
table = table[table['Username'] != 'TryMeHacker'] #excluding TryMeHacker from the table
kd_rt = kd_rt[kd_rt['Username'] != 'TryMeHacker']
wl_rt = wl_rt[wl_rt['Username'] != 'TryMeHacker']
table[295:297]
Out[22]:
Place Username Points (in thousands) Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
295 296 AI5U 1203.1 3404 3496 19444 3664 3143 3087 17 NaN
297 298 xxAri 1202.8 3331 6086 16175 13811 3088 2079 5 United States

Next we will analyze the correlation between two ratios of players.

In [23]:
kd_wl = pd.concat([kd_rt, wl_rt['Wins/Losses']], axis = 1, join='inner')
kd_wl[:]
Out[23]:
Place Username Kills/Deaths Wins/Losses
0 1 StrafeYosef 7.985197 2.387789
1 2 prinsese1 2.706690 4.417552
2 3 Lcea 2.866296 2.281471
3 4 HappyStateOfMind 2.257086 3.250410
4 5 Taivax 7.017773 18.780277
... ... ... ... ...
995 996 ThunderGamingX 3.969117 0.822335
996 997 Kuadro 1.370120 0.635496
997 998 swordfish09304 1.322118 0.976087
998 999 LaMinecraftienne 2.075468 1.725632
999 1000 0hSora 1.062583 0.734478

999 rows × 4 columns

In [24]:
plt.figure(figsize=(8,8))
x1 = kd_wl['Kills/Deaths']
y1 = kd_wl['Wins/Losses']
plt.scatter(x1, y1, label=f'Correlation coefficient = {np.round(np.corrcoef(x1,y1)[0,1], 3)}') #correlation coef for data

plt.title('Figure 4.7 Correlation between Kills/Deaths and Wins/Losses')
plt.xlabel('Kills/Deaths')
plt.ylabel('Wins/Losses')
plt.legend(prop={'size': 11})
plt.show()

The Figure 4.7 shows that there is low positive linear correlation between KDR and WLR.

Distribution analysis

This shows that the player have a general tendency on getting higher WLR than a KDR. Also, WLR shows huge amount of outliers, players with extreme values of WLR, much higher than the mean. This has a huge effect on the variance of the data.

Generally kills, deaths and wins are showing the logarithmic tendency with respect to the ranking of the player by points while the losses not. This might be due to lost games not playing a big role towards getting in game points.

4.2 Q2: Analyze relationship between variables and place of the player in the data set

In [25]:
table.head(10)
Out[25]:
Place Username Points (in thousands) Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
0 1 StrafeYosef 6197.5 15683 22251 152661 19118 22776 22080 1 Israel
1 2 prinsese1 5859.0 16208 19877 95197 35171 16855 20461 0 NaN
2 3 Lcea 5224.8 14517 20880 127275 44404 16311 13458 4 Germany
3 4 HappyStateOfMind 4812.4 13889 18162 79314 35140 6719 6660 28 Norway
4 5 Taivax 4392.6 10855 11433 91210 12997 25977 23505 80 United Kingdom
5 6 Dragasdata 4286.4 11087 15368 80086 28333 26353 20198 2 NaN
6 7 xoLarry2Pro 4285.1 10884 13333 95093 23854 14255 12560 24 France
7 8 Foony 4255.2 11193 11382 78566 12488 14652 14969 420 Netherlands
8 9 CRYBL0CKER 4241.9 11438 14361 90960 25247 13085 13814 29 South Africa
9 10 Kylic 4231.9 11016 14150 101989 36658 14707 13627 29 United Kingdom

For the following analyze we will look at the tendencies between the place (rank) of the player and points, beds, teams eliminated and winstreak. We will noot look at the kills, deaths, victories as we already discussed them in the previous part. We also will not look at the games for now, as it will be observed in another part.

In [26]:
#before we start it's better to divide the players by the rank groups
top25=table[0:24]
top100=table[25:99]
top250=table[100:249]
top1000=table[250:]
In [27]:
fig = plt.figure(figsize=(18,15))
fig.suptitle('Figure 4.8 Scatterplots, dependence on Ranking', fontsize=16)
ax1 = fig.add_subplot(421) #axes1
ax1.scatter(x=top25['Place'], 
            y=top25['Points (in thousands)'], 
           color = 'g')
ax1.scatter(x=top100['Place'], 
            y=top100['Points (in thousands)'], 
           color = 'y')
ax1.scatter(x=top250['Place'], 
            y=top250['Points (in thousands)'], 
           color = 'b')
ax1.scatter(x=top1000['Place'], 
            y=top1000['Points (in thousands)'], 
           color = 'r')
ax1.set_ylabel('Points (in thousands)', fontsize=15)
ax1.set_xlabel('Place', fontsize=15)

ax2 = fig.add_subplot(422) #axes2
ax2.scatter(x=top25['Place'], 
            y=top25['Beds'], 
           color = 'g')
ax2.scatter(x=top100['Place'], 
            y=top100['Beds'], 
           color = 'y')
ax2.scatter(x=top250['Place'], 
            y=top250['Beds'], 
           color = 'b')
ax2.scatter(x=top1000['Place'], 
            y=top1000['Beds'], 
           color = 'r')
ax2.set_ylabel('Beds', fontsize=15)
ax2.set_xlabel('Place', fontsize=15)

ax3 = fig.add_subplot(423) #axes3
ax3.scatter(x=top25['Place'], 
            y=top25['Teams Eliminated'], 
           color = 'g')
ax3.scatter(x=top100['Place'], 
            y=top100['Teams Eliminated'], 
           color = 'y')
ax3.scatter(x=top250['Place'], 
            y=top250['Teams Eliminated'], 
           color = 'b')
ax3.scatter(x=top1000['Place'], 
            y=top1000['Teams Eliminated'], 
           color = 'r')
ax3.set_ylabel('Teams Eliminated', fontsize=15)
ax3.set_xlabel('Place', fontsize=15)

ax4 = fig.add_subplot(424) #axes4
ax4.scatter(x=top25['Place'], 
            y=top25['Winstreak'], 
           color = 'g')
ax4.scatter(x=top100['Place'], 
            y=top100['Winstreak'], 
           color = 'y')
ax4.scatter(x=top250['Place'], 
            y=top250['Winstreak'], 
           color = 'b')
ax4.scatter(x=top1000['Place'], 
            y=top1000['Winstreak'], 
           color = 'r')
ax4.set_ylabel('Winstreak', fontsize=15)
ax4.set_xlabel('Place', fontsize=15)
fig.legend(['1-25 rank', '26-100 rank', '101-250 rank', '251-1000 rank'],  prop={'size': 15})
plt.show()

As we can see from the Figure 4.8, Points, beds and teams eliminated clearly show the pattern close to logarithmic curve with points being the most close to the shape. Everywhere except the winstreak we can see the positive relationship between the rank of the player and the place. We will observe the tendencies in winstreak further in the analysis.

4.3 Q3: Analyze the winstreak, relationship between winstreak and KDR, WLR

Not all of the players in top 1000 have a winstreak, so for this part we will need to look at those players who have a winstreak of at least 1.

In [28]:
table.sort_values(by = ['Winstreak'], ascending = False).head(10) #just sorted
Out[28]:
Place Username Points (in thousands) Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
171 172 Hwamzx 1588.9 4102 4129 24818 2756 5166 5464 526 NaN
488 489 AzzifyBestWW 872.2 2458 2481 15804 2832 2452 2262 521 NaN
190 191 Pighs 1513.8 3965 4318 27630 6383 5319 5644 436 NaN
7 8 Foony 4255.2 11193 11382 78566 12488 14652 14969 420 Netherlands
599 600 tastydish 754.3 1836 1846 11081 1510 2747 2457 229 NaN
160 161 PigJesus 1626.8 4482 4983 34619 6624 4654 4011 197 NaN
139 140 Vadanite 1727.4 4859 5124 30223 3200 7261 6698 137 NaN
143 144 Juulis 1697.8 4485 6072 33954 9797 5356 4481 106 Russia
210 211 Shaidyn 1443.4 3875 3925 28159 4266 8056 6985 101 United Kingdom
180 181 Moibus 1557.3 4286 6669 33151 13646 3471 4796 86 Russia
In [29]:
table.loc[table['Winstreak']>0].sort_values(by = ['Winstreak'], ascending = False) #use this to determine how many players have winstreak
Out[29]:
Place Username Points (in thousands) Victories Games Kills Deaths Beds Teams Eliminated Winstreak Country
171 172 Hwamzx 1588.9 4102 4129 24818 2756 5166 5464 526 NaN
488 489 AzzifyBestWW 872.2 2458 2481 15804 2832 2452 2262 521 NaN
190 191 Pighs 1513.8 3965 4318 27630 6383 5319 5644 436 NaN
7 8 Foony 4255.2 11193 11382 78566 12488 14652 14969 420 Netherlands
599 600 tastydish 754.3 1836 1846 11081 1510 2747 2457 229 NaN
... ... ... ... ... ... ... ... ... ... ... ...
516 517 23week_Waza 836.0 1918 2289 22945 5290 4603 4175 1 United States
524 525 Gcbriel 828.3 2065 3982 20136 9026 2877 2267 1 Poland
531 532 0hStormi 821.2 2248 4531 19906 9672 1988 1736 1 NaN
533 534 Hadst 815.8 1731 4282 21347 8774 3633 3746 1 NaN
402 403 Black_Sheepy 989.8 2328 5534 25851 6738 8520 7578 1 United States

479 rows × 11 columns

So we can see that 479 players have a winstreak greater than 0.

In [30]:
table['Winstreak'].describe() #descriptive statistics of winstreak of all players
Out[30]:
count    999.000000
mean       6.573574
std       33.510211
min        0.000000
25%        0.000000
50%        0.000000
75%        3.000000
max      526.000000
Name: Winstreak, dtype: float64
In [31]:
table.loc[table['Winstreak']>0]['Winstreak'].describe() #descriptive statistics of winstreak of players that have a winstreak
Out[31]:
count    479.000000
mean      13.709812
std       47.397206
min        1.000000
25%        1.000000
50%        3.000000
75%        9.000000
max      526.000000
Name: Winstreak, dtype: float64

Because the winstreak has a huge standart deviation and a lot of extremely big values, we will provide a histogram with values that are logarithmically scaled by a factor of 2 for the players with a winstreak.

In [32]:
#create a dataframe for scaled values
lg_ws = pd.DataFrame()
lg_ws['Place'] = table.loc[table['Winstreak']>0]['Place']
lg_ws['Username'] = table.loc[table['Winstreak']>0]['Username']
lg_ws['Log winstreak'] = np.log2(table.loc[table['Winstreak']>0]['Winstreak']).round(3) #scaling by log2
lg_ws.sort_values(by = ['Log winstreak'], ascending = False)[:]
Out[32]:
Place Username Log winstreak
171 172 Hwamzx 9.039
488 489 AzzifyBestWW 9.025
190 191 Pighs 8.768
7 8 Foony 8.714
599 600 tastydish 7.839
... ... ... ...
516 517 23week_Waza 0.000
524 525 Gcbriel 0.000
531 532 0hStormi 0.000
533 534 Hadst 0.000
402 403 Black_Sheepy 0.000

479 rows × 3 columns

In [33]:
plt.figure(figsize=(12,6)) #creating a histogram
plt.hist(lg_ws['Log winstreak'], 20, density=1, facecolor='g', alpha=0.75)
plt.xlabel('$log_2(Winstreak)$')
plt.ylabel('Probability')
plt.title('Figure 4.9 Histogram of winstreak (Log scaled)' , fontsize = 16)
plt.grid(True)
plt.show()

From the Figure 4.9 we can see that winstreak is positively skewed. The player that has the biggest winstreak goes by the username "Hwamzx".

In [34]:
lg_ws['KDR'] = kd_wl['Kills/Deaths'] #use kd_wl data frame we used while analyzing KDR and WLR to add columns to lg_ws
lg_ws['WLR'] = kd_wl['Wins/Losses']
lg_ws.dropna(axis=0) #to get rid of null values

#create rank groups for colors
lg_ws.loc[lg_ws['Place'] <= 25, 'Rank group'] = '1-25'
lg_ws.loc[lg_ws['Place'] > 25, 'Rank group'] = '26-100'
lg_ws.loc[lg_ws['Place'] > 100, 'Rank group'] = '101-250'
lg_ws.loc[lg_ws['Place'] > 250, 'Rank group'] = '251-1000'
In [35]:
fig = px.scatter(lg_ws, y="Log winstreak", x="Place", color="Rank group",
                 hover_data=['Username'], title = 'Figure 4.10 Log Winstreak/Place')
fig.show()

From the Figure 4.10 we can see that there is not much dependance between the place of the player and the winstreak the player gets

In [36]:
fig = px.scatter(lg_ws, y="Log winstreak", x="KDR", color="Rank group",
                 hover_data=['Username'], title = 'Figure 4.11 Log Winstreak/KDR')
fig.show()

From the Figure 4.11 we can see that the likelyhood of getting high winstreak is bigger if a player has good KDR.

In [37]:
fig = px.scatter(lg_ws, y="Log winstreak", x="WLR", color="Rank group",
                 hover_data=['Username'], title = 'Figure 4.12 Log Winstreak/WLR')
fig.show()

From the Figure 4.12 we can see that the likelyhood of getting high winstreak also is bigger if a player has good WLR, but not as much as KDR.

Distribution analysis

Most of the players, even if they're considered being in top 1000, doesn't have a high winstreak, however, some of the players managed to get extremely high winstreak. Clearly data is positively skewed.

The scatter plots show that there might be some dependance between WLR, KDR and winstreak, but not with the Place of the player.

4.4 Q4: Propose new ranking model based on average points per game

Ranking players by the average points could be a good practice, as this shows how efficient each player in the game.

Before we begin, lets make a separate column for average points per game.

In [38]:
#call the column Pnt_game
#refer back to points dict we saved for this moment
table['Pnt_game'] = points.div(table['Games']).round(2) #divide points by amount of games
table[['Place', 'Username', 'Pnt_game']].head(10)
Out[38]:
Place Username Pnt_game
0 1 StrafeYosef 278.53
1 2 prinsese1 294.76
2 3 Lcea 250.23
3 4 HappyStateOfMind 264.97
4 5 Taivax 384.21
5 6 Dragasdata 278.92
6 7 xoLarry2Pro 321.39
7 8 Foony 373.85
8 9 CRYBL0CKER 295.38
9 10 Kylic 299.08
In [39]:
#now, lets sort the values
table[['Place', 'Username', 'Pnt_game']].sort_values(by = ['Pnt_game'], ascending = False)
Out[39]:
Place Username Pnt_game
599 600 tastydish 408.62
714 715 YDMYK 397.54
171 172 Hwamzx 384.81
4 5 Taivax 384.21
7 8 Foony 373.85
... ... ... ...
905 906 sagey2000 87.97
927 928 iixouds 85.63
775 776 BabyMelodyFox 85.41
838 839 Oh_Srry 82.67
975 976 BbyTiger 48.91

999 rows × 3 columns

In [40]:
#create rank groups for colors
table.loc[table['Place'] <= 25, 'Rank group'] = '1-25'
table.loc[table['Place'] > 25, 'Rank group'] = '26-100'
table.loc[table['Place'] > 100, 'Rank group'] = '101-250'
table.loc[table['Place'] > 250, 'Rank group'] = '251-1000'

#create boxplot
px.box(table, x = 'Pnt_game', hover_data=['Username', 'Place'],
       title = 'Figure 4.13 Box Plot for average points per game')

Figure 4.13 shows us two outliers, lower outlier 'BbyTiger' and upper outlier 'tastydish'.

We can see that the average points per game for all players is almost symmetrically skewed.

50% of the players are getting on average less than 230 points per game.

In [41]:
#create a boxplots grouped by rank
px.box(table, x = 'Pnt_game', hover_data=['Username', 'Place'], color = 'Rank group',
       title = 'Figure 4.14 Box Plot for average points per game, grouped by rank')

Figure 4.14 displays some interesting results. While rank groups '1-25' and '251-1000' are both close to be symmetrically skewed, in contrast, rank group '26-100' has a positive skewness and rank group '101-250' has a negative skewness.

The only outliers belong to the rank group '251-1000' (tastydish and YDMYK), and they are the best 2 players by average amount of points per game, which could imply that those two accounts belong to experienced players who own several game accounts.

As expected, on average, the better the rank group, the greater amount of points players from that rank group are getting per game. We can see that from the box plot as each successive box plot's box is further to the right than the previous one.

Now lets rank the players by the average point per game.

In [42]:
#using rank() function we will add new column
table['Rank pt/g'] = table['Pnt_game'].rank(ascending = False)
table[['Username', 'Place', 'Rank pt/g']][:]
Out[42]:
Username Place Rank pt/g
0 StrafeYosef 1 221.0
1 prinsese1 2 149.0
2 Lcea 3 382.0
3 HappyStateOfMind 4 286.0
4 Taivax 5 4.0
... ... ... ...
995 ThunderGamingX 996 651.0
996 Kuadro 997 909.0
997 swordfish09304 998 740.0
998 LaMinecraftienne 999 525.0
999 0hSora 1000 703.0

999 rows × 3 columns

In [43]:
#creating a scatter plot to observe a correlation between two rankings
fig = px.scatter(table, y="Rank pt/g", x="Place", color="Rank group",
                 hover_data=['Username'], title = 'Figure 4.15 Scatter plot of two rankings')
fig.show()
In [44]:
#print correlation coefficient between two rankings
print('Correlation coefficient = ' + str(np.round(np.corrcoef(table['Rank pt/g'],table['Place'])[0,1], 3))) 
Correlation coefficient = 0.313

While the rank groups have a tendency to get greater points as better the group, for the players individually, there is very low correlation ($r= 0.313$) between the place by points and the rank we proposed.

Yet, the greater the average points per game the player has, the faster this player will be getting to high ranks. Ranking the players for their points per game could be a good indicator for future tendencies in the rankings.

4.5 Q5: Analyze geographical tendencies among players' origin countries

Lets begin from creating dataframe for showing the countries and number of the top 1000 players in those countries.

In [45]:
cntry = pd.DataFrame()

#group by to retrieve number of players in each country, count by usernames
cntry = table[['Country', 'Username']].groupby(['Country']).count()

#move index to column
cntry.reset_index(inplace=True)

#rename to fit the purpose
cntry = cntry.rename(columns={'Username':'Count'})
#sort by count of players
cntry = cntry.sort_values(by = ['Count'], ascending = False)
cntry[:]
Out[45]:
Country Count
61 United States 91
60 United Kingdom 83
12 France 33
13 Germany 30
5 Canada 19
... ... ...
29 Mexico 1
41 Qatar 1
30 Mongolia 1
31 Nauru 1
18 Iceland 1

68 rows × 2 columns

So we have data about players from 68 different countries that we know off. United States has the most players from top 1000 among the World.

In [46]:
#summing all of the points of players for each country
cntry2 = table[['Country', 'Points (in thousands)']].groupby(['Country']).sum()
cntry2 = cntry2.sort_values(by = ['Points (in thousands)'], ascending = False)
cntry2.reset_index(inplace=True)
#merging with previous table
cntry = pd.merge(cntry, cntry2, on = 'Country')
cntry[:]
Out[46]:
Country Count Points (in thousands)
0 United States 91 113425.2
1 United Kingdom 83 100296.6
2 France 33 38045.7
3 Germany 30 38528.5
4 Canada 19 21280.4
... ... ... ...
63 Mexico 1 784.1
64 Qatar 1 523.8
65 Mongolia 1 714.5
66 Nauru 1 615.6
67 Iceland 1 1415.8

68 rows × 3 columns

In [47]:
#for the analysis we will use average points per single player of a country
cntry['Points per player (in thousands)'] = cntry['Points (in thousands)'].div(cntry['Count']).round(3)

#also, rank them by points per player
cntry['Rank'] = cntry['Points per player (in thousands)'].rank(ascending = False)
cntry = cntry.sort_values(by = ['Rank'])
cntry[:]
Out[47]:
Country Count Points (in thousands) Points per player (in thousands) Rank
39 Czech Republic 2 6333.2 3166.600 1.0
22 Israel 4 11036.2 2759.050 2.0
43 Bangladesh 1 2732.1 2732.100 3.0
59 Sint Maarten 1 2591.3 2591.300 4.0
33 South Africa 2 4921.4 2460.700 5.0
... ... ... ... ... ...
57 Lebanon 1 577.5 577.500 64.0
49 Fiji 1 573.3 573.300 65.0
31 Slovakia 2 1128.8 564.400 66.0
25 Egypt 3 1645.6 548.533 67.0
64 Qatar 1 523.8 523.800 68.0

68 rows × 5 columns

In [48]:
#creating a ge scatter plot that shows number of players by size of dots, avg points per player by color
fig = px.scatter_geo(cntry, locations="Country", locationmode = 'country names', color="Points per player (in thousands)",
                     hover_name="Country", size = 'Count', size_max = 50,
                     projection="natural earth", hover_data = ['Rank', 'Points (in thousands)'],
                    color_continuous_scale=px.colors.sequential.Bluered, 
                    title = 'Figure 4.16 Scatter world map of countries and the points per player')

fig.show()

Figure 4.16 presents us interactive information about how players distributed around the world, clearly, most of the player base is located in the Europe and Northern America, but there is also some players from Australia and Russia.

From the table we can see that Czech Republic is on the lead by average points per player. On the second and third places settled Israel and Bangladesh being on the very thin score margin (both at 2700 thousands per player).

Average points per player ranking is dominated by the countries with lower amount of players, as the score is less distributed.

5. Conclusion

Based on the analysis given above we could conclude several things about tendencies among top 1000 Hive Bedwars players.

  • Players tend to get greater Win/Loss ratio than Kill/Death ratio. Also, there is very low correlation between two ratios.
  • There is positive relationship between the place(rank) of the player and the number of points he gets, amount of beds and teams the player has eliminated.
  • Winstreak the player gets doesn't depend on the place the player holds, but it could be affected by KDR and WLR.
  • Ranking the players by average points per game could be useful in instances when we are trying to predict future standings of the players by points. This is supported by the fact that there is correlation between rank groups and distribution of average points per game of a players of that rank group.
  • Most of the players come from Northern America and Europe, still, there is a lot of extremely good players in countries with quite low number of players.